src/GFP32/[GFP32]Complete_Analysis.Rmd · MonolayerCultures

---
title: "GFP32_analysis"
author: "Nina-Lydia Kazakou"
date: "19/03/2022"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Set-up

### Load libraries

```{r message=FALSE, warning=FALSE}
library(SingleCellExperiment)
library(Seurat)
library(tidyverse)
library(Matrix)
library(scales)
library(cowplot)
library(RCurl)
library(scDblFinder)
library(Matrix)
library(ggplot2)
library(edgeR)
library(dplyr)
library(ggsci)  
library(here)
library(gtools)
```

### Colour Palette

```{r load_palette}
mypal <- pal_npg("nrc", alpha = 0.7)(10)
mypal2 <-pal_tron("legacy", alpha = 0.7)(7)
mypal3<-pal_lancet("lanonc", alpha = 0.7)(9)
mypal4<-pal_simpsons(palette = c("springfield"), alpha = 0.7)(16)
mypal5 <- pal_rickandmorty(palette = c("schwifty"), alpha = 0.7)(6)
mypal6 <- pal_futurama(palette = c("planetexpress"), alpha = 0.7)(5)
mypal7 <- pal_startrek(palette = c("uniform"), alpha = 0.7)(5)
mycoloursP<- c(mypal, mypal2, mypal3, mypal4)
```

I will try to keep the analysis as similar as possible to the original one; I will not alter object names so that it is easily relatable. 

These data were generated with 10x kits v2 and they were originally analysed in 2018, thus the analysis is quite different to the one I did later on for the RC17 monolayer cultures. Despite this, in the end it is pretty obvious that the cells in both instances behave very similarly. 

# Create srt
```{r}
wk1 <- Read10X(here("data", "CellRanger", "PreOL", "filtered_gene_bc_matrices", "GRCh38"))
wk2 <- Read10X(here("data", "CellRanger", "EarlyOL", "filtered_gene_bc_matrices", "GRCh38"))
wk3 <- Read10X(here("data", "CellRanger", "LateOL", "filtered_gene_bc_matrices", "GRCh38"))
```

```{r}
opc <- CreateSeuratObject(counts = wk1, project = "GFP32_wk1")
earlyOL <- CreateSeuratObject(counts = wk2, project = "GFP32_wk2")
lateOL <- CreateSeuratObject(counts = wk3, project = "GFP32_wk3")
```

```{r}
hOL_merge <- merge(x = opc, y = c(earlyOL, lateOL), add.cell.ids = c("OPC", "earlyOL", "lateOL"), project = "GFP32_merge")
```

### Populate meta.data

```{r}
hOL_merge$log10GenesPerUMI <- log10(hOL_merge$nFeature_RNA) / log10(hOL_merge$nCount_RNA)

chemistry <- rep("v2", ncol(hOL_merge))
hOL_merge <- AddMetaData(hOL_merge, chemistry, col.name = "Chemistry")

hOL_merge[["percent.mito"]] <- PercentageFeatureSet(hOL_merge, pattern = "^MT-")
hOL_merge[["percent.ribo"]] <- PercentageFeatureSet(hOL_merge, pattern = "^RP[SL]")

metadata.filt <- hOL_merge@meta.data

metadata.filt$Cells <- rownames(metadata.filt)

metadata.filt$Sample <- NA
t1 <- grepl("^wk1_", metadata.filt$Cells)
t2 <- grepl("^wk2_", metadata.filt$Cells)
t3 <- grepl("^wk3_", metadata.filt$Cells)

metadata.filt$Sample[which(t1)] <- "wk1"
metadata.filt$Sample[which(t2)] <- "wk2"
metadata.filt$Sample[which(t3)] <- "wk3"

# Add metadata.filt back to Seurat object
hOL_merge@meta.data <- metadata.filt
head(hOL_merge@meta.data, 2)
```

```{r fig.height=5, fig.width=9}
VlnPlot(hOL_merge, features = c("nFeature_RNA", "nCount_RNA", "percent.mito"), pt.size = 0.01, cols = mycoloursP) + NoLegend()
```

```{r fig.height=4, fig.width=10}
plot1 <- FeatureScatter(hOL_merge, feature1 = "nCount_RNA", feature2 = "percent.mito")
plot2 <- FeatureScatter(hOL_merge, feature1 = "nCount_RNA", feature2 = "nFeature_RNA", cols = mycoloursP)
plot1 + plot2
```


# QC and selecting cells for further analysis

A few QC metrics commonly used by the community include:
*The number of unique genes detected in each cell.
  +Low-quality cells or empty droplets will often have very few genes
  +Cell doublets or multiplets may exhibit an aberrantly high gene count
*Similarly, the total number of molecules detected within a cell (correlates strongly with unique genes)
*The percentage of reads that map to the mitochondrial genome
  +Low-quality / dying cells often exhibit extensive mitochondrial contamination
  +We calculate mitochondrial QC metrics with the PercentageFeatureSet() function, which calculates the percentage of counts originating from a set of features
  +We use the set of all genes starting with MT- as a set of mitochondrial genes

I will use the same method and same  thresholds as before. 

```{r}
hOL.filt <- subset(hOL_merge, subset = percent.mito < 10)

high.det.opc <- WhichCells(hOL.filt, expression = nFeature_RNA >7500 & orig.ident == "GFP32_wk1")
high.det.earlyOL <- WhichCells(hOL.filt, expression = nFeature_RNA >5000 & orig.ident == "GFP32_wk2")
high.det.lateOL <- WhichCells(hOL.filt, expression = nFeature_RNA >5500 & orig.ident == "GFP32_wk3")
hOL.filt <- subset(hOL.filt, cells = setdiff(WhichCells(hOL.filt), c(high.det.earlyOL, high.det.lateOL, high.det.opc)))

low.opc <- WhichCells(hOL.filt, expression = nFeature_RNA < 3000 & orig.ident == "GFP32_wk1")
low.early <- WhichCells(hOL.filt, expression = nFeature_RNA < 1500 & orig.ident == "GFP32_wk2")
low.late <- WhichCells(hOL.filt, expression = nFeature_RNA < 2000 & orig.ident == "GFP32_wk3")
hOL.filt <- subset(hOL.filt, cells = setdiff(WhichCells(hOL.filt), c(low.early, low.late, low.opc)))

table(Idents(hOL.filt))

table(Idents(hOL_merge))
```


# Normalisation

After removing unwanted cells from the dataset, the next step is to normalize the data. By default, we employ a global-scaling normalization method “LogNormalize” that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. Normalized values are stored in hOL_merge[["RNA"]]@data.

```{r}
hOL_merge <- NormalizeData(hOL_merge, normalization.method = "LogNormalize", scale.factor = 10000)
```

# Identification of highly variable features (feature selection)

```{r}
hOL_merge <- FindVariableFeatures(hOL_merge, selection.method = "vst", nfeatures = 2000)
```

## Identify the 10 most highly variable genes:

```{r message=FALSE, warning=FALSE}
top10 <- head(VariableFeatures(hOL_merge), 10)
plotA <- VariableFeaturePlot(hOL_merge, cols = mycoloursP[4:5])
plotB <- LabelPoints(plot = plotA, points = top10, repel = TRUE)
plotB
```

# Scale all genes (if there is a batch effect I could add vars to regress here)
Next, we apply a linear transformation (‘scaling’) that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The ScaleData() function:

*Shifts the expression of each gene, so that the mean expression across cells is 0
*Scales the expression of each gene, so that the variance across cells is 1
  +This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate
*The results of this are stored in pbmc[["RNA"]]@scale.data

```{r}
all.genes <- rownames(hOL_merge)
hOL_merge <- ScaleData(hOL_merge, features = all.genes)
```

# Linear dimensional reduction

```{r}
hOL_merge <- RunPCA(hOL_merge, features = VariableFeatures(hOL_merge), npcs = 100)
```

```{r}
VizDimLoadings(hOL_merge, dims = 1:2, reduction = "pca")
```


```{r}
DimPlot(hOL_merge, reduction = "pca", cols= mycoloursP[22:50])

DimHeatmap(hOL_merge, dims = 1, cells = 500, balanced = TRUE)
```

```{r fig.height=16, fig.width=10}
DimHeatmap(hOL_merge, dims = 1:20, cells = 500, balanced = TRUE)
```

```{r fig.height=6, fig.width=10}
PCAPlot(hOL_merge, split.by = "orig.ident", cols= mycoloursP[22:50])
```

# Dimentionality Reduction:

```{r}
ElbowPlot(hOL_merge)
```

```{r}
hOL_merge <- FindNeighbors(hOL_merge, dims = 1:15, verbose = FALSE)
hOL_merge <- FindClusters(hOL_merge, resolution = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0), verbose = FALSE)
head(hOL_merge@meta.data, 2)
```

Due to many updates and the quite old kit version used, the data look a bit different, so instead here, I will load the original analysis object used for downstream analysis.

Load original object:
```{r}
GFP32 <- readRDS(here("data", "Processed", "Original_Objects", "GFP32_annotated_srt.rds"))
```

Based on previous analysis the best resolution for these data is **RNA_snn_res.0.5**. The Clusters have been annotated accordingly. In this dataset the oligodendroglia was not subsetted and sub-clusters; It was annotated as part of the whole dataset. 

Here are some plots:
```{r}
Idents(GFP32) <- "Cluster_id"
DimPlot(GFP32, pt.size = 0.8, label = TRUE, cols = mycoloursP) & NoLegend()
```

```{r fig.width=8}
Idents(GFP32) <- "orig.ident"
DimPlot(GFP32, pt.size = 0.8, label = FALSE, cols = mycoloursP[15:40]) 
```


```{r fig.height=5, fig.width=14}
Idents(GFP32) <- "Cluster_id"
DimPlot(GFP32, pt.size = 0.8, label = TRUE, cols = mycoloursP, split.by = "orig.ident", label.size = 3.5) & NoLegend()
```

```{r}
Idents(GFP32) <- "Cluster_id"
DefaultAssay(GFP32) <- "RNA"
```

```{r}
oligodendroglia_markers <- c("OLIG1", "OLIG2", "SOX10", "PDGFRA", "SOX6", "MAG", "MOG", "MBP", "PLP1", "CNP", "MYRF", "GPR17", "NKX2-2", "MYC", "PTGDS")
other_markers <- c("AQP4", "GFAP", "HOPX", "DCX", "MAP2", "SOX2", "SOX9", "SOX11", "PPP1R16B")
```

```{r fig.height=18, fig.width=7}
FeaturePlot(GFP32, 
            features = oligodendroglia_markers, 
            order = TRUE, label = FALSE, ncol = 2)
```

```{r fig.height=12, fig.width=7}
FeaturePlot(GFP32, 
            features = other_markers, 
            order = TRUE, label = FALSE, ncol = 2)
```

```{r}
sessionInfo()
```